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Abstract 

This manuscript shows that AdaBoost and 
its immediate variants can produce approxi- 
mate maximum margin classifiers simply by 
scaling step size choices with a fixed small 
constant. In this way, when the unsealed step 
size is an optimal choice, these results pro- 
vide guarantees for Friedman's empirically 
successful "shrinkage" procedure for gradient 
boosting (Friedman, 2000). Guarantees are 
also provided for a variety of other step sizes, 
affirming the intuition that increasingly regu- 
larized line searches provide improved margin 
guarantees. The results hold for the exponen- 
tial loss and similar losses, most notably the 
logistic loss. 

1. Introduction 

AdaBoost and related boosting algorithms greedily ag- 
gregate many simple predictors into a single accurate 
predictor (Freund & Schapire, 1997). One explanation 
for the efficacy of boosting is that it not only seeks ag- 
gregates with low empirical risk, but moreover that it 
prefers good margins, which leads to improved gener- 
alization (Schapire ct al., 1997). Since AdaBoost does 
not attain maximum margins on general instances, a 
push was made to develop methods which carry such a 
guarantee (Ratsch & Warmuth, 2005; Shalev-Shwartz 
& Singer, 2008; Rudin et al, 2007). 

This work shows that margin maximization may be 
achieved by scaling back the step size. The intuition 
for this result is simple (cf. Figure 1): when (equiv- 
alently) considered as steps in a coordinate descent 
procedure, the iterates, depicted as a path, approx- 
imate the path of constrained optima (for all possi- 
ble choices of constraint). By scaling back the step 
size, the optimal path is more finely approximated. 
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As there have been many proposed step sizes for 
these methods, this manuscript will study four sepa- 
rate choices, deriving improved bounds for the more 
regularized choices. While it has been shown be- 
fore that regularized step sizes have good generaliza- 
tion and asymptotically good margins (Zhang & Yu, 
2005) , this manuscript shows that straightforward step 
choices achieve these margins at rates matching explic- 
itly margin-maximizing boosting methods. 

This practice of scaling back weights was proposed by 
Friedman (2000, Section 5), who referred to it as a 
shrinkage scheme (Copas, 1983). This scheme is effec- 
tive, and adopted in practice (see for instance Bradski 
(2000, Class CvGBTrees) and Pedregosa et al. (2011, 
Class GradientBoostingClassif ier)); the purpose 
of this manuscript is to provide theoretical guarantees. 

1.1. Outline 

After summarizing the main content, this introduc- 
tion closes with connections to related work; there- 
after, Section 2 recalls the core algorithm, defines the 
class of loss functions, and provides the four step sizes. 



— no shrinkage 

— shrinkage 

— constrained 











Figure 1. The blue diagonal line is the empirical risk mini- 
mizer subject to varying I 1 constraints, and is also a max- 
imum margin choice. The green line takes optimal steps, 
and grossly overshoots the optimal path. By applying mild 
shrinkage, the red line approximates the maximum margin 
choice much more finely. 
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As boosting is generally studied under the weak learn- 
ing assumption (a separability condition), the domi- 
nant study in this manuscript is also under the con- 
dition of separability, and appears in Section 3. The 
first step is to show that shrinkage does not drasti- 
cally change the rate of convergence of the empirical 
risk under these methods. The more involved study 
is on the topic of margins, and the final subsection 
compares these bounds to those of other methods. 

General (potentially nonseparable) instances are dis- 
cussed in Section 4. Once again, the first step is 
a convergence rate guarantee, which again matches 
those without shrinkage. This section also demon- 
strates that, under a certain decomposition of boosting 
problems, the algorithm is still achieving margins on 
a separable sub-component of the problem. 

The manuscript closes with some discussion in Sec- 
tion 5. All proofs are relegated to appendices (in the 
supplementary material) . 

1.2. Related Work 

Three close works proposed regularized line searches 
for boosting. First, Friedman (2000) gave the same 
scheme as is considered here (albeit with only the op- 
timal line search) ; follow-up work has been mainly em- 
pirical, and the questions of convergence rates and 
margin guarantees do not appear in the literature. 
Second, Zhang & Yu (2005) also considered regular- 
ized line searches, but with a goal of proving consis- 
tency; margin maximization is proved as a byproduct, 
and the analogous results here hold under fewer con- 
ditions, and come with rates for the more stringent 
step sizes. A third work, due to Ratsch et al. (2001), 
also proves margin maximizing properties of regular- 
ized line searches, but again without rates. 

As mentioned in the introduction, margin maximiza- 
tion properties of AdaBoost have received extensive 
study; an excellent survey of results with pointers 
to other literature is provided by Schapire & Freund 
(2012, Chapter 5). Amongst these, a crucial result, 
due to Rudin et al. (2004), provides a concrete input 
to AdaBoost which yields suboptimal margins (which 
is used in Section 3.3); that work also studies the evo- 
lution of these margins as a dynamical system, a topic 
which will reappear in Section 5. 

The primary contribution of this manuscript is to ex- 
hibit margin maximization, thus a natural comparison 
is to other algorithms with this same guarantee, for in- 
stance the works of Ratsch & Warmuth (2005), Shalev- 
Shwartz & Singer (2008), and Rudin et al. (2007) (or 
again refer to Schapire & Freund (2012, Chapter 5, 



Bibliographic Notes) for a more extensive summary). 
This manuscript will briefly compare with the meth- 
ods of Shalev-Shwartz & Singer (2008), which subsume 
some earlier results and match the best guarantees, 
along with giving a simple, general, greedy scheme. 
The key distinction between previous work and the 
present work is firstly that the algorithmic modifica- 
tions here are minor (in particular, the form of unregu- 
larized empirical risk minimization is unchanged) , and 
that properties of an existing, widely used method are 
discerned (namely, the shrinkage procedure presented 
by Friedman (2000)). 

As is standard in the above works, this manuscript is 
only concerned with convergence of empirical quanti- 
ties. 

In order to prove convergence rates, this work relies 
heavily on techniques due to Telgarsky (2012). In 
particular, the scheme to prove convergence rates of 
empirical risk, detailed properties of splitting out a 
hard core from a boosting instance (cf. Section 4), 
and the notion of relative curvature (cf. Section 2.1) 
are all due to Telgarsky (2012). The intent of the 
present manuscript is to establish margin properties, 
and in this regard it departs from Telgarsky (2012); by 
contrast, the convergence rates of empirical risk pre- 
sented here are thus trivial, but included since they 
did not appear explicitly in the literature. It is worth 
mentioning that these methods produce bad constants 
when applied to the logistic loss; unfortunately, pre- 
vious work also suffers in this case (for instance, the 
work of Collins et al. (2002) provided only convergence 
of empirical risk, and not rates). 

2. Algorithms and Notation 

First some basic notation. Let {(a:,,^)}^ C X x 
{ — 1,4-1} denote an m-point sample. Take Ho to de- 
note the collection of weak learners; it is assumed that 
h E Ho satisfies h(X) C [—1,4-1], and that Ho has 
some form of bounded complexity meaning specifically 
that the set of vectors {(h(xi), . . . , h{x m )) : h 6 Ho} 
is finite; this for instance holds if there is a fixed finite 
set of outputs from Ho, e.g., each h is binary. Conse- 
quently, let H = {hj}™ =1 denote the effective finite set 
of hypothesis, and collect the responses on the sample 
into a matrix A e [— 1, -f l] mx ™ with A io — —y i hj(x i ). 

Boosting finds a weighting A 6 R™ of H, which corre- 
sponds to a regressor x i-> X)j=i ^jhj(x), and thus a 
binary classification rule after thresholding. The corre- 
sponding (I i minimum) margin M(AX) over the sam- 
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pie with respect to A is 



M{A\) 



»e[m] j|A| 



mm 

iG [m] 



|A|| 



Let 7 denote the best (largest) achievable margin; 
equivalently (Shalev-Shwartz & Singer, 2008), 7 is the 
weak learning rate (which justifies the choice of l\ mar- 
gins): 

7 := max Ai(AX) = max min — e. T ^4A 

AGE" AeR™ i£[m] 

l|A|li = l l|A|li = l 



mm max 

ti)GA m jG[n] 



i=l 



min ||A T w;| 



When 7 > 0, the instance is considered separable; clas- 
sically, this condition is termed the weak learning as- 
sumption (Kearns & Valiant, 1989; Freund & Schapire, 
1997). 

2.1. The Family of Loss Functions 

The class L will effectively be "functions similar to the 
exponential loss" . Some of this is for analytic conve- 
nience, but some of this appears to be essential, and 
thus a bit of motivation is appropriate. 

Optimization problems typically take advantage of 
curvature (e.g., strong convexity) to establish a con- 
vergence rate. The analysis here instead uses a relative 
form of curvature: it suffices for, say, the Hessian to 
not be too small relative to the gap between the cur- 
rent primal objective value and the primal optimum. 
In this sense, the exponential loss is ideal, as it is a 
fixed point of the differentiation operator. 

Definition 2.1. Given a loss I : K — > M++ (where 
M++ denotes positive reals), let Ci{z) > 1 (with po- 
tentially C'i(z) — 00) be the tightest positive constant 
so that, for every x < z: Ce(z)" 1 < ex.p(x) / £^ (x) < 
Ci(z) for i £ {0,1,2} (the zeroth, first, and second 
derivatives) . 

Since Ci(z) is defined to be the tightest constant, it 
follows that y < z implies Ci(y) < Ci(z). 

From here, the class of loss functions may be defined. 

Definition 2.2. Let L contain all functions i : R — >• 
1R+ which are twice continuously differentiable, strictly 
convex, and have Cg{z) < 00 for all Addition- 
ally, if lim2_ i ._ 00 Cg(z) = 1, then £ G L^. 

Crucially, the two classes L and both contain the 
exponential and logistic losses. 

Proposition 2.3. {x H> exp(x),x H> ln(l + 
exp(a;))} C L^. 



Algorithm 1 BOOST. 

Input: loss £, matrix A £ [-1, +l] mxr 

Output: Weighting sequence {A t }^L . 

Initialize Ao := 0. 

for t = 1,2,... : do 

Choose column (weak learner) 



J 1 



arg max | V£(A\ t _i) T Aej 



Set descent direction v t £ {±e,, t }, whereby 
VCiAXt^Avt = -\\VC(AX t ^) T A\\ a 



Find at via line search. 
Update A t :— Xt-i + att- 
end for 



One way to interpret this is to say "in the limit, logistic 
loss is the same as exponential loss". Unfortunately, 
this treatment of the logistic loss ends up being quite 
unfair, in the sense that the bounds are not accurately 
representative of the behavior of the algorithm (see 
Section 3.3). It is, however, unclear how to better 
deal with the logistic loss. 

Lastly, the relevant primal objective function may be 
defined. 

Definition 2.4. Given I e L and vector z £ K m , de- 
fine C(z) := m _1 53»=i ^( z i): whereby the primal op- 
timization problem for boosting is to minimize C{AX) 
over the domain W 1 . For convenience, define Ca '■= 
inf AeRn £(AA). 

2.2. Algorithm 

The algorithm appears in Algorithm 1. Before defining 
the various step sizes, two more definitions are in order. 
Definition 2.5. For every t, define jt '■= 
II^V^AXt-iJIU/llVrOAAt-iJUi. (Note that 
1 > 7t > 7.) 

Additionally, rather than depending on parameter 
Ci[z) for a carefully chosen z, the following definition 
suffices. 



Definition 2.6. For t 

C z {l- x {mL(A\ t _ r ))). 



> 1, define Ct 







The significance of Ct is as follows. Since the al- 
gorithm itself is coordinate descent, and moreover 
since every line search will be shown to guarantee de- 
scent, every candidate A considered in round t will 
satisfy C{A\) < C(AX t -i); thus, for every i £ [m], 
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£(ejAX) < m£(AX) < mC{A\ t _ x ), and so ej AX < 
£~ 1 (m£(A\ t _i)), where the inverse is well-defined 
since £ is a bijection between K and M++ by defini- 
tion of L (otherwise Cg(z) — oo). 

The collection of step sizes considered here are as fol- 
lows, in order of least to most aggressive. Throughout 
these step sizes, v £ (0, 1] will denote a shrinkage pa- 
rameter. 

Quadratic upper bound. Rather than performing 
an optimal line search, i.e., rather than minimiz- 
ing a i — ^ £(A(A t _i + av t )), a quadratic upper 
bound of this univariate function may be mini- 
mized, which has a closed form solution (cf. the 
proof of Lemma 3.2). In particular, define the step 
size a^(^) := vj t /Cf. This choice is pleasant al- 
gorithmically only when Ct is easy to compute 
(for instance, Ct = 1 for the exponential loss). In 
general, however, it is useful as an analytic aid, 
since most step sizes here can be lower bounded 
by it. This step size was introduced by Tclgarsky 
(2012, Appendix D.3). 

Wolfe. The Wolfe line search is a standard tool from 
nonlinear optimization (Nocedal & Wright, 2006, 
chapter 3), and for convex problems it may be 
implemented with binary search (Telgarsky, 2012, 
Appendix D.l). More precisely, this choice is a 
set of step sizes {v) satisfying two conditions. 
First, the step is explicitly disallowed from being 
too large: 

C(A(X t -! + avt)) 

< C{AXt-i) - a(l - I //2)||A T V£(AA t _i)|| 0o . 

(2.7) 

Second, the step should be approximately optimal 
(in terms of the line search problem): 

V£(A(A t _i + avt)) T Av t 

> -(l-^/4)||V£(AA f _ 1 ) T A|| co . (2.8) 

(Requiring the reverse inequality (with the right 
hand side negated) yields the Strong Wolfe Con- 
ditions, which are not necessary here.) In con- 
trast to Q!j (v), the Wolfe step does not require 
knowledge of C t , but will yield nearly identical 
bounds; in fact, computation of the Wolfe step 
requires only function evaluations, gradient eval- 
uations, and knowledge of v, A,vt, At. 

AdaBoost. Following the scheme of AdaBoost, de- 
fine a^{v) := | ln( ), where convention is 
followed and 7t = 1 is ignored. Unfortunately, 



even though 7 t is loss-dependent, this step will 
only yield rates with the exponential loss. How- 
ever, it will be instrumental in analyzing the fully 
optimizing step size, presented next. This step 
size was introduced with the original presentation 
of AdaBoost (Freund & Schapire, 1997), though 
the analysis here will rather follow a slightly later 
treatment (Schapire & Singer, 1999). 

Optimal. Let af(l) be a minimizer to a H > 
£(yl(A f _i + av t )), which, as in the case of af(v), 
is assumed to exist. For v £ (0,1), set a^(v) = 
^ap(l). When A is binary and £ — exp, ct®{y) = 
a^(v), though in general this is not true. This 
step size (with shrinkage!) was suggested by 
Friedman (2000) for use with the logistic loss. 

To close, note that (v) and ap(^) have a simple 
relationship. 

Proposition 2.9. If A £ [-l,+l] mx ™ and £ £ L, 
then af{v) < a?{v). 

3. The Separable Case 

This section considers the setting of separability, 
meaning the weak learning assumption is satisfied 
(7 > 0). The three subsections respectively provide 
convergence rates in empirical risk, basic margin guar- 
antees, and close with some discussion. 

3.1. Convergence of Empirical Risk 

The basic guarantee is that all of these line search 
methods, for any loss in L and with arbitrary shrink- 
age, exhibit the same basic convergence rate as Ad- 
aBoost. 

Theorem 3.1. Let boosting matrix A with correspond- 
ing 7 > and shrinkage parameter v £ (0, 1] be given. 
Given any £ £ h, any e > 0, and iterates {A t } t >o 
consistent with {u), af{v), a®(v), or af{v) with 
£ = exp, then 0{^hi(^)) iterations suffice to ensure 
C(AX t ) < e, where the O(-) suppresses terms depend- 
ing on C\ and v. 

The proof is in the appendix, but a basic discussion 
will appear here for each step size. The proofs are 
straightforward, as they should be: convergence anal- 
yses typically prove a bound for one step, and then 
iterate the bound. As such, taking \jv steps which 
are ^-factor as long as the original should do at least 
as well as the original (which is indeed the exhibited 
trade-off). 

First is the quadratic upper bound, which implicitly 
gives an upper bound for the optimal step as well. The 
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proof follows a standard scheme from convex optimiza- 
tion of lower and upper bounding a potential function 
based on the gradient; the specifics use the relative 
curvature properties of L, and follow the analysis of 
Telgarsky (2012, Section 6.1, Appendix D). 

Lemma 3.2. Consider the setting of Theorem 3.1, but 
with each step size a t satisfying a® (y) < a t < aP(z/). 
Then for any t > to > 0, 



C(AX t ) <£(AX t0 )exp 



i/(2 



2C L 




The reason for the parameter to is to mitigate the 
horrendous dependence on Ct , which is potentially 
very large. In particular, consider £ G L M , mean- 
ing lim^_ i ._ 00 C((z) = 1. Ci may be quite bad, but 
convergence still happens. It follows that Ct —> 1, 
and thus, by choosing some large to, the bound pro- 
vides that perhaps there is an initially slow conver- 
gence phase, but eventually it is very fast. That is 
to stay, Lemma 3.2 may be applied multiple times to 
give a more refined picture of the convergence, partic- 
ularly in the case that I G Loo, which guarantees the 
constants are eventually near 1. 

Next, the Wolfe step size has a similar guarantee (and 
the analysis once again heavily relies on techniques due 
to Telgarsky (2012, 6.1, Appendix D)). 

Lemma 3.3. Consider the setting of Theorem 3.1, but 
with at G a^ (is). Then for any t > to > 0, 



C(AXt) < £(AA t0 )exp 



i/(2 



'o+l i=t +l 




(The denominator blows up by a factor 4 due to extra 
halves introduced into the Wolfe conditions, specifi- 
cally to adjust around the natural Wolfe parameters 
being within (0, 1) and not (0, 1].) 

Lastly consider af~(v). As in the statement of The- 
orem 3.1, this step size is only shown to work with 
the exponential loss. This may be an artifact of the 
analysis, however, which perhaps follows too closely 
the treatment of Schapire & Singer (1999), which only 
considers the exponential loss; for instance, a slightly 
modified step size can be used to show convergence 
with the logistic loss (Collins et al., 2002). 

Lemma 3.4. Consider the setting of Theorem 3.1, but 
with at G ct^(v) Then for any t > to > 0, 



C(AXt) < C(AX t 



, n c i 

i=t a + l 



('-?0 



3.2. Margin Maximization 

The margin rates here follow a simple pattern: the 
more regularized the step size, the faster the conver- 
gence to a good margin. While no lower bounds are 
presented, this is an interesting and intuitive corre- 
spondence (in particular, consistent with Figure 1). 
Unfortunately, the unconstrained step sizes only have 
asymptotic convergence (no rates), so the umbrella 
theorem for this subsection is also asymptotic. 

Theorem 3.5. Let boosting matrix A with correspond- 
ing 7 > and shrinkage parameter v G (0, 1] be given. 
Given any £ € Loo, any e > 0, and iterates {Xt}t>o 



consistent with a^(i/), af (v), a^(v) with £ — exp, or 
a®(v) with binary A G { — 1, +l} mXTl , then there exists 
T so that for all M(AX t ) > 7 - e for all t > T. 

In contrast with the convergence rates of empirical risk 
(e.g., Theorem 3.1), the condition £ G Loo is made, 
rather than simply £ G L (with improved constants 
when £ G Loo). This can be interpreted to say: the 
analysis depends heavily upon the structure of the ex- 
ponential loss. While this condition is likely unneces- 
sary, on the other extreme it is important for the loss 
to be strictly convex; if for instance the hinge loss is 
used, then minimization can stop at any point achiev- 
ing zero error, in particular at one with poor margin 
properties. 

Returning to task, the quadratic upper bound comes 
first. 

Lemma 3.6. Suppose the setting of Theorem 3.5, but 
with at = atiy). Additionally let t > to > be given 



with t > 



2G\ ln(m) 



7 2 i/(2- 

tive by Lemma 3.2) 



(whereby all margins are nonnega- 
Then 



M{AX t 



> 



( 2 



\ 2C t +i 



where 



co :=max< 1, mCt +iC(AX to ) exp 



ln(c ) 
tvj 



v(2 - v) 
. 2C t+i 



E^ 2 

i=i / 



To interpret this bound, first consider the simplifying 
case that £ ~ exp, whereby C t — 1 for all t. Addi- 
tionally taking to = 0, it follows that cq = m, and the 
bound is simply 



M{AX t ) > 7 (l 



ln(m) 
tisj 



in particular, M(AXt) — > 7 as v — > and tv — > 00. For 
some other £ G Loo, the denominator term C^ 0+1 also 
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presents an obstacle to establishing margin maximiza- 
tion; but note that to — > oo suffices, since it combines 
with £ £ Loo via Theorem 3.1 to grant Ct — > 1. 

The proof of Lemma 3.6 does not have to work too 
hard, as the step size appears prominently in the con- 
vergence rate bound (cf. Lemma 3.2). As will be dis- 
cussed in Section 3.3, the rate is nearly ideal. 

The Wolfe search exhibits a similar rate. 

Lemma 3.7. Suppose the setting of Theorem 3.5, but 

with at = a^{v). Additionally let t > to > be given 

with t > ^^pz^j (whereby all margins are nonnega- 
tive by Lemma 3.3). Then 



( 2-v 



M{AX t )> 1 



to + l 



4Ci m(c ) 



where 



c :=max|l,mC , t 0+ i£(AA to )exp f ^2~^ a ^j- 

The preceding two step choices, [y) and af {u), had 
explicit regularization: the first stops as soon as the 
steepest matching quadratic turns upward, and the 
second refuses to go beyond a boundary (cf. eq. (2.7)). 

On the other hand, the choices ct^{v) and a®{v) are 
only constrained by the data. Recall that one way to 
derive af{v) is in the case of binary A £ { — 1, +l} mx ™ 
and £ = exp, where it is crucial that each weak learner 
is wrong on at least one example: this prevents steps 
from being too large. The techniques in the following 
proof follow those used in the margin bounds for reg- 
ular AdaBoost (and are asymptotic there as well). It 
is worth noting that not only is this bound the worst, 
but the analysis is the trickiest. 

Lemma 3.8. Consider the setting of Theorem 3.5, but 
now I = exp and at = a^(v). Then for any e £ (0, 7], 
there exists T so that M(AXt) > 7 — e for all t >T. 

Similarly, a^{v) is only implicitly regularized. The 
condition that A £ { — l,+l} mx ™ prevents the nega- 
tive, constraining examples from having too little in- 
fluence. 

Lemma 3.9. Consider the setting of Theorem 3.5, but 
now £ = exp, the matrix A is binary, and at = a^iy). 
Then for any e > 0, there exists T so that A4(AX t ) > 
7 - e for all t > T. 

The above lemmas together provide the proof of Theo- 
rem 3.5. But before closing, note that while the results 
for the unconstrained step sizes were only asymptotic, 
it is possible to derive a rate for the more modest goal 
of margins closer to 7/3. 



Proposition 3.10. Consider the setting of Theo- 
rem 3.5, but specialized with £ = exp and at = a^(v). 
Let a target margin value 9 < 7 be given. If 8 < 
7/(1 + 7) (e.g., it suffices that 9 < j/2), then 



m * — ' 



-ejAX t 
IMI 



< 



< 



exp 



-ii/(7 2 -6*7(2 + 7)) 



In particular, if < 7/(2 + 7) (e.g., it suffices that 
9 < 7/3J and t > 2ln(m)/(v{"/ 2 - 9-/(2 + 7))), then 
M(AXt) > 9. 

Note, of course, that this bound has the severe analytic 
artifact of demonstrating no benefit of shrinkage! 

3.3. Discussion 

To get a sense of these margin bounds, first recall Fre- 
und's lower bound on boosting methods in the separa- 
ble case, which states that fi(^Tn(i)) iterations are 
necessary to achieve classification error r > (Fre- 
und, 1995, Section 2). Setting r = 1/m, it follows 
that f2(ln(m)/7 2 ) iterations are necessary to achieve 
any nonnegative margin. By comparison, with af 1 (u) 
and £ = exp, just 121n(m)/7 2 iterations with choice 
v = 1/2 suffice to reach margin 7/2 (by Lemma 3.7). 
More generally, aj v (i/) reaches margin 7(1 — v) with 
81n(m)/(^7) 2 iterations (if step size {v) is used, 
then 2 ln(m)/(z/7) 2 iterations suffice by Lemma 3.6). 

The explicit margin-maximizing method of Shalev- 
Shwartz & Singer (2008) requires t > 32 ln(m) /e 2 it- 
erations to achieve margin 7 — e, where e £ (0, 7) . 
By comparison, converting the above multiplicative 
bound into an additive bound, step size a]r (e/7) 
requires 81n(m)/e 2 iterations. While this bound 
is slightly better, the comparison is not fair, since 
a^(e/j) requires knowledge of 7 in the choice of 
shrinkage parameter v. (Pessimistically taking v = e 
gives an additive guarantee, but with a poor rate.) 
Consequently, it can be reasoned that shrinkage meth- 
ods achieve excellent margins, but are best suited for 
multiplicative guarantees. 

Another question is how accurately the bounds pre- 
sented here depict the methods provided. As a brief 
sanity check, the methods may be run on a prob- 
lem instance where AdaBoost demonstrably does not 
achieve maximum margins. The particular instance 
tested here is a binary matrix A £ {—1, +l} 8xS due to 
Rudin ct al. (2004, Theorem 7); recall that AdaBoost, 
in the present notation (with A binary), corresponds 
to £ — exp and step size (1) = ap(l) (no shrinkage). 
Two plots are provided. 

1. Figure 2 is a sanity check, showing that £ = exp 
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Figure 2. Sanity check: shrinkage leads to margin maxi- 
mization. 



Figure 3. Sanity check: the Wolfe search effectively maxi- 
mizes margins. 



and at (I) = a?(l) may not achieve maximum 
margins, but shrinkage overcomes this. 

2. Figure 3 demonstrates that the Wolfe search (with 
I = exp) is indeed effective, but demanding higher 
accuracy comes at a price. 

These plots will be discussed further in Section 5. Ad- 
ditional tests with this matrix demonstrated that the 
method of Shalev-Shwartz & Singer (2008) indeed per- 
forms a tiny bit worse than the Wolfe search, but of 
course one example is not terribly indicative. Perhaps 
most importantly, a test with the logistic loss showed 
that the bound is loose: the logistic loss performs well, 
and does not suffer a startup cost as indicated by the 
bounds. 

4. The General Case 

The last technical contribution of this manuscript is to 
briefly consider the general case (which is potentially 
nonseparable). Similarly to the separable case, this 
section will establish convergence rates for empirical 
risk, margin guarantees, and briefly discuss the con- 
nection to existing margin maximizing methods. But 
first, it is necessary to discuss the structure of the gen- 
eral case, and in particular to develop what margins 
mean without separability. 

This section hinges upon the following decomposi- 
tion of a boosting instance. This decomposition par- 
titions a boosting instance, specifically its examples 
{(%i,Vi)}iLi, into a hard subset H(A), and an easy 
subset H(A) C . The easy subset alone is separable, and 
thus margins will be measured there. Although the 
analysis will rely heavily on properties of this decom- 



position due to Telgarsky (2012), the decomposition 
itself has appeared, with various guarantees, in nu- 
merous places (Goldreich & Levin, 1989; Impagliazzo, 
1995; Mukherjee et al., 2011). The notation H(A) re- 
flects the fact that this structure has no relation to the 
choice of £ G L. 

Definition 4.1. (Cf. Telgarsky (2012, Definition 5.1, 
5.7).) Given a boosting problem encoded in a matrix 
A £ R mxn , a set of examples (rows) H{A) C [to] 
is a hard core for A (and the corresponding boosting 
problem) if it satisfies the following properties. 

• There exists a weighting A £ K n with ej AX < 
for i G H{A) C and ej AX = for i £ H(A). 

• Every weighting A £ K n with e^^lA < for some 
i £ H(A) also has ejAX > for some k £ H(A). 

Additionally, define a row- wise partition of A into ma- 
trices A a ,A + , where A + has the examples in H(A), 
and Ao has the examples in H(A) C . 

The second property provides that H(A) is difficult: 
positive margins on some examples force negative mar- 
gins on others. On the other hand, the complement 
H(A) C is easy, and moreover can be solved without 
affecting H(A). 

Proposition 4.2. (Cf. Telgarsky (2012, Proposition 
5.8, Theorem 5.9).) For any A £ IR" IX ™, a hard core 
H(A) always exists, and is unique. 

With the decomposition in place, the aforementioned 
guarantees may be stated. The first, as in the sepa- 
rable case, is convergence of empirical risk. There is 
hardly anything to do here; the groundwork from Sec- 
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tion 3 can be plugged directly into existing techniques 
to generate this theorem (Telgarsky, 2012, Section 6). 

Theorem 4.3. Let general boosting matrix A be given 
(i.e., potentially 7 = 0J, along with shrinkage param- 
eter v G (0,1], any £ 6 L, and target suboptimality 
e > 0. Suppose step sizes {at}t>o are consistent with 
a^(v), '(y), a®{v), or a^{v) with £ = exp and A 
binary. Then O(-) iterations suffice to reach subopti- 
mality e > 0. 

If the instance is either separable (i.e., 7 > as in Sec- 
tion 3) or attains its minimizer (i.e., \H{A)\ = m (Tel- 
garsky, 2012, Theorem 5.5)), then the rate improves to 
0(ln(i)). 

Lastly come the margin guarantees. As stated above, 
H(A) C , considered alone, is separable; note further- 
more that the definition of hard core provides the ex- 
istence of a weighting A. which has positive margins 
over H{A) C , but abstains entirely over H{A). Con- 
sequently, an approximate minimizer to C(A-) can al- 
ways add in a scaling of A and improve its empiri- 
cal risk while simultaneously improving margins over 
H(A) C . Consequently, it is natural to expect the meth- 
ods here to achieve positive margins over H(A) C . Note 
that the following result only shows that some positive 
margins are attained, and neither assert some sense 
under which they are maximal, nor does it provide 
rates. 

Theorem 4.4. Let general boosting matrix A be given 
with 1 < |i/(A)| < m — 1 (i.e., the problem is nei- 
ther separable, nor is the minimizer attainable). Let 
shrinkage parameter v £ (0, 1] and any £ £ be 
given. Suppose step sizes {at}t>o are consistent with 
ct^{y), ct^(y), ol^{v) with £ = exp and binary A, or 
OL^(y) with I = exp and binary A., Then there exists 
7 > so that every example off the hard core (i.e., 
i € H(A) C ) has margin at least 7 for all large t. 

To close, consider once again the comparison to ex- 
plicit margin maximizing boosting methods as pre- 
sented by Shalev-Shwartz & Singer (2008). There is 
no point in discussing the specific method discussed in 
Section 3.3, whose optimal objective value is exactly 
7, which in this case is zero, and the method may hap- 
pily quit without iterating. Indeed, a primary contri- 
bution of Shalev-Shwartz & Singer (2008) is not only 
to address this issue, but show how the same general 
boosting scheme can be instantiated for the aforemen- 
tioned method, as well as methods with tolerance to 
nonseparability. 

Indeed, consider the "soft-margin" boosting method 
(Shalev-Shwartz & Singer, 2008), originally due to 
Warmuth et al. (2006), which, roughly speaking, has 



a parameter controlling how many examples to give 
up on. This is in contrast to the methods here, which 
not only have a fixed data-dependant structure they 
try less hard on (the hard core H(A)), but moreover 
the particular margins achieved over the hard core are 
determined by the loss function £ 6 L. It is of course 
worth mentioning that the margin analysis in the non- 
separable case here is by comparison very incomplete, 
providing no rates and not even identifying exactly 
what positive margins are attained. 

5. Discussion 

This manuscript immediately raises a number of ques- 
tions. Perhaps foremost is the general question of the 
impact of margins on the efficacy of boosting. Al- 
though margins certainly provide an intuitive theory, 
it is still unclear how much they directly correlate with 
good algorithms (Rcyzin & Schapire, 2006). 

Next, the bounds for the logistic loss are not tight. As 
there do not appear to be any more forgiving analy- 
ses of the logistic loss, the natural question is whether 
there are new techniques which provide a better char- 
acterization. 

Lastly, Figure 2 shows a threshold effect: shrinkage 1 
does not lead to the right margin, but 1/2 and smaller 
suffices to reach the maximum margin. (Indeed, exper- 
imentation reveals the threshold to be roughly 0.92.) 
It should be possible to clarify this behavior from the 
perspective of dynamical systems: smaller steps dodge 
bad attractors (Rudin et al, 2004; 2007). 
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A. Deferred Material from Section 2 

Proof of Proposition 2. 3. There is nothing to show for 
exp, so consider £{x) = ln(l + exp(x)), let z € K be 
given, and let x < z be arbitrary. 

Concavity grants ln(l + exp(x)) < exp(ir). The lower 
bound can be checked in two stages. First, if x < 
min{ — l,z}, a Taylor expansion gives 



ln(l + e x ) > e x - sup — 

CGR 2(1 + 



'2x 



> e x 1 - 



min{e 2 ,e } 



On the other hand, if — 1 < x < z, then e x < e z ln(l + 
e- 1 )/ln(l + e- 1 ) < e z ln(l + e x ) / ln(l + e" 1 ). 

Next, £'(x) = e x /(l + e% so £'{x) < e x < £'(x)(l + 
z z ). Similarly, £"(x) = e x /(l + e x ) 2 , so £"{x) < e x < 



£"{x){l + e z ) 2 



□ 



The following lemma (and its proof) derive a^(v), es- 
tablishes ot^lv) < a®(v), and gives the basic improve- 
ment due to one step satisfying a € [a^{y), a^{v)\. 

Lemma A.l. Let boosting matrix A, shrinkage pa- 
rameter v € (0, 1], and any £ £ L be given. For any 
iteration t, it holds that a^ +1 {v) < a^ +l (y). Further- 
more, any step a £ [a^ +1 (iy) , a'f > +1 (iy)] satisfies 



C(A(X t + av t+1 )) < C(AX t ) exp 



2Ct+i 



Proof. This analysis follows a scheme laid out by Tel- 
garsky (2012, Appendix D.3). Let t denote any fixed 
iteration, and / denote the (possibly unbounded) in- 
terval 



/ := {a > : £{A{\ t + av t+1 )) < C(AX t )} ; 



by continuity of £ and choice of «t+i, / is nonempty, 
with nonempty interior. By second order Taylor ex- 



pansion, every a £ I satisfies 
C(A(X t + avt+i)) 
< C{AX t ) + avJ +1 A T VC(AX t ) 

2 

+ sup ^-vJ +1 A T V 2 C{A{X t + rv t+ i))Av t+1 

r£l * 

<£(AA t )-a||A T V£(AA t )||oo 

„2 



a' 



2 rg j m 

< C(AX t ) - a\\A'V X(AX moo 
r 2 rv 2 1 m 

• ' .sup^^^(e^(A t + ™ t+1 ))4 t+1 

, C 2 +1 a 2 



i=l 



= £(AX t )-a\\A T V£(AX t 
< C(AX t ) — a\\A T V C(AX t 



-C(AXt) 
|V£(A\ t 



which made use of £" < Ct+i exp < Cf +1 £ along I, £< 
C t +i exp < C 2 +l £' along /, < 1 (since elements 

of T~L are bounded in this way) , and the definition of I 
(specifically r = is the worst choice for r £ I). This 
final expression is a quadratic, whose minimizer must 
lie within / (since its second derivative exceeds that of 
C along this interval). Differentiating and setting to 
zero, the minimizer is 

\A T VC(AX t )\\ a 



J9 



C*||V£(A\ t 



It+i 

u t+l 



«t+l(l). 



This provides a derivation of the step a^ +1 (v), and also 
shows a^|_ 1 (l) < a° +1 (l). Plugging a^ +1 (v) in for a 
in the above quadratic upper bound, 



C(A(X t + av t+1 ))<C(AX t ) 



K2- J /) 7t 2 +1 ||V£(AA t )|| 



< C(AX t ) 1 



< £(AA t )exp - 



u{2 - v)j 2 +1 
i/(2 - ^) 7 2 +i 



□ 



Proof of Proposition 2. 9. This is the first part of 
Lemma A.l. □ 

B. Deferred Material from Section 3 
B.l. Deferred Material from Section 3.1 

Proof of Lemma 3.2. By Lemma A.l, for any t, 



£{A(X t + avt+i)) < £{AX t ) exp 
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Now let to < t be given as in the desired statement, 
apply this bound t — to times, and use the fact that 
C t +i < C t . □ 



Proof of Lemma 3. 3. Let t denote any fixed iteration. 
Substituting c\ = 1 — vjl, C2 = 1 — ^/4, and r\ = 
Cf +1 in a nearly identical guarantee for the Wolfe line 
search (Tclgarsky, 2012, Proposition D.6) (where r\ is 
simply the biggest ratio between i and I" in the current 
sublevel set) provides 



C(A(X t + avt+i)) 



< C(AX t ) 

< C(AX t ) ( 1 



2C? +1 £(AX t ) 



< C{A\ t ) exp - 



y{2 - vH +x 
8C t+i 
"(2 - ^)7 t 2 +i 



Given to < t, applying this bound t — to times and 
using C t +i < C t gives the result. □ 



W = J2 i wi < mCf +1 £(AX t ) ■ By convexity of exp(-), 
C(AX t+1 ) 

<^±i^exp(e7AA t+1 ) 



< 



Chi fW\^^ _{l + ejAv t+1 



^ Wi exp 
1 - ejAv t+ i 



Ott+l 



{-ott+i) 



< 



/// \ 2 



/ n , 1 + 7t+l / \ 

exp(a t+1 ) H exp(-a t+1 ) 



2 



£(AA t )(l-7 t 2 + i) 1 



((l- 7t+1 )i- + (l + 7t+1 )i-) 



To simplify this expression, note that v is a con- 
cave function, and thus 



(l-Tt+i) 1 ^ , (l+7t+i) ] 



< 



1 - It+i , 1 + 7h 



Next, instead of directly proving Lemma 3.4, a more 
general lemma is given first, which will be useful later. 

Lemma B.l. Consider the setting of Theorem 3.1, 
except now each step size cti satisfies 



To finish, given t > to, the result follows by t — t a 
applications of these bounds. □ 



Q-tiy) ~ T < a i < a tW) + T 



for some r > 0. Then, given t > to, 



Proof of Lemma 3.4- This follows by taking the sec- 
ond bound in Lemma B.l with the choice r = 0. □ 



C(AX t+1 ) 
< £{AX t0 ) 



H ~ ^r /2 ((l + l*) 1 - + (1 - 7,) 1 ^) 



i=t + l 



t+1 e r Cf 



i=to+l 



Proof of Theorem 3.1. The result follows from 
Lemma 3.2, Lemma 3.3, and Lemma 3.4 with the 
choice to = and using C t +i < C t . □ 

B.2. Deferred Material from Section 3.2 

Proof of Lemma 3.6. To start, note that 



Proof. Fix an iteration t, and set Wi — £'(e- AXt) and 



IIAmll 



t+i 

v^vtfiCr 4 ' 

i=l 



t+1 



t+1 
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By the form of C and the optimization guarantee in 
Lemma 3.2, 

max exp(ej AX t +i) 
fce[m] 



Additionally, note 
IIAmlli 



t+i 
t=i 



t+i 



< ^exp(e^A m ) 

i=l 

< mC t+1 C{AX t+1 ) 

< m C t0+1 C(A)^)ew(-&^- £ 7 2 

\ to+1 i=i +l / 



raC to+1 £(^4A to )exp 



i/(2 - i/1 



to 

/O to + 1 i=1 



7i 



where Co is as in the statement. Since ln(-) is increas- 
ing, it follows that 

max e[A\ t+1 < ^ "Ti + ln ( c °)- 



Using the above bound on ||At||i, since t-f < J2t=i lu 
and —e^AXt+i is nonnegative by the lower bound on 



mm — n — > nun — — t+i 



fc£[ro] ||A t+ i||i fe£[m] i/^iiiT. 



>-,-( 2 -. U \ ln(C0) 



2 — 2/ \ ln(cg) 



> 7 i 



2c t 6 0+1 ; (t+ih 



□ 



Proof of Lemma 3. 7. Any step size ott+i satisfying the 
Wolfe conditions will have lower bound 



at+i > 



> 



(l-(l- V /A))\\A T C(A\ t )\\ 

CfC(AX t ) 
(l-(l-v/4))\\A T £(AX t )\\ 
C t 4 ||V£(^A t )||i 

4c? +1 - 4cr 



indeed this expression appears in proofs demonstrat- 
ing the improvement due to a single step of the Wolfe 
search, see for instance Telgarsky (2012, Proof of 
Proposition D.6, second to last line). 



Direct from the first Wolfe condition (eq. (2.7)), 
C(AX t+1 ) 

= C(A(X t + a t+1 v t+1 )) 

< C(AX t ) - a t+1 (l - i//2)||A T V£(A\ t )|| 00 

a t+1 {l-u/2)\\A T VC(A\ t )\\ oc 
C(AX t ) 



< C(AX t ) 1 - 



< L(AX t ) (l- Qm( 2 2 ^ )7t+1 

Now let t > to be given as in the statement. Applying 
the above inequality t — to times, 

max exp(e^AA f+ i) 

ke [ml 



< mC t+1 C{AX t+1 ) 



t+i 



<mC t0+1 £(AA t0 )exp( £ 



< mC to+ i£(AA to )exp 



(2 - y)i 
, 2C l+i — , 



• exp 



(2-^)7 

2^0 + 1 



t+1 



(2 - i/)7 



t+1 



to + 1 j=l 

where Co is as in the statement. Since ln(-) is increas- 
ing, it follows that 



(2 ^ )7 X> + ln(co). 



max e fe AX t+1 < 

fc£[mj ZU, 



to + 1 j=1 

Using the above lower bound on a* in terms of ji, and 
since all margins are nonnegative by the lower bound 
on t, 



-ejAXt+i 



fc£[m] ||A t+ i| 



> min 



-ejAXt+i 

t+! „ 



ken e;Ii 



>7 



>7 



2-1/ \ _ ln(cp) 

2C, t 2 o+i/ 

2- i/ \ _ 4fff ln(co) 

2^; J _ ft+ih 



□ 



The remainder of this subsection provides proofs for 
o$~(v) and ap(i/), but uses some later material, most 
specifically the quantity T v . 
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Lemma B.2. Consider the setting of Theorem 3.1, 
except now each step size cti satisfies 

af{v) - t <a t < af{v) + T 

for some r > 0. Let 9 6 [0,7) be given. Then, given 
t>t , 



the vaguely simpler bound 



i=i 



< 



|At+i||i 

<mC t+1 expiOWXMCiAXt,) J] 



4+1/ /-, . N 6v/2 



i=io+l 



l-7i 



•ejAA t+ i 



< 



-Willi 

<m<7 t+1 exp(0||Aj|i)£(AA to ) J] 



■ e-Cf (1 - t?)"/ 3 



□ 



t+l / /1 , N 0J//2 



»=io+l 



1-7, 



r a 4 



Proof of Lemma 3.8. Set := 7 — e, whereby G 
[0,7). Invoking Lemma B.2 and simplifying terms via 
to = 0, Cj = 1, and t = 0, then for any f, 



(l-7 2 r /2 ((l + 7 l ) i ^ + (l-7) i ^) 



i=*o+l 



<mC t+l ex V {e\\K\\i)£{AK) II 

■ e ^(i- 7 |r /2 V 



t+i / /1 . N 

flr / 1 + 7< ^ 



Ei 



1-7, 



-e a ^A t+ i 



< 



— "' 1 j 



i=l 



1 + 7, 
1-7, 



8v/2 



■ i(i- 7i 2 r /2 ((i+7,) 1 ^ + (i-7,) 1 ^) 



Proof. To start, 



— "' n 



Ei 



-ejAA t+ i 



< 



i=l 



1 + 7 
1^7 



01//2 



Willi 

m 

= ^l[0||A 4+1 || 1 +e i AA t+1 >0] 
»=i 

< mC t+ iexp(0||A t+ i||i)£(AA t+1 ). 



Next, note 

l|At+i||i = 



t+i 

A to + ^ ctiVi 

i=t + l 

<l|A to lli+ E C^ + ^M)- 

i=*o+l 

Combining these facts with the convergence bound 
from Lemma B.l, 



Ei 



-ejA Xt+i 
l|A tH 



< 



llli 



t+l 



<mC t+1 exp^HA^IIx^AAtJ [] le 

i=t + l \ 

e T Cf 



n T I 1 + 7, 
1-7, 



8v/2 



.i(i- 7 Y/2((i +7 )i- + ( i- 7 )i-)j ) 



where the replacement of 7.; by 7 made use of the 
first part of Lemma B.6. Now, by the second part of 
Lemma B.6, this inner term is less than 1 iff 9 < T„( 7 ). 
By Theorem B.5, since 9 < 7, there exists a v suffi- 
ciently small that T„( 7 ) > 0. Consequently, there ex- 
ists a T so that this product is less than 1/m whenever 
t >T, and the result follows. □ 

Proof of Lemma 3.9. Set := 7 — e, whereby G 
[0,7). Since i G Loo, choose to large enough so that 



r 8 < f 1 + 7 



By Lemma B.7, it follows that the optimal step size 
satisfies 

a£(v)-T<a£(u)<a£(,u) + T 

with r = 5 ln(C 4 ). Combining this with the bound on 

Ct above, 



-(l-7 2 r /2 ((l + 7,) 1 ^ + (l-7,) 1 ^) 



^ = Cl < 



1 + 7 
1^7 



(7-9)" 



As in the proof of Lemma B.l, " is concave, so 
the 1/2 may be pushed inside this last term to give 



Plugging this into the general margin bound in 
Lemma B.2 and additionally replacing 7 j with 7 
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thanks to the first part of Lemma B.6, and finally set- 
ting 9' := 9 + (7 - 0)/2 = (6 + 7 )/2 < 7, 



-ejAX t +i 
IIAmlli 



< 



<mC t+ iexp(0||A to ||i)£(A\ to ) 



t+i / 

n e 

»=*o+l V 



8v/2 



1-7, 

(l-7f r /2 (( i + 7 . ) i- + ( i_ 7 .)i- ) 



<mC m exp(0||Aj|i)£(A\to) 
'1 + 7 S 

i=*o+l 



n 1 



7 



•i(l-7 2 r /2 ((l + 7) 1 ^ + (l-7) 1 ^) 
^mQ+xexp^llAtJIxKCAAtJ 



t+i 

n 

i=to+l 



1 + 7 
1-7 



e'v/2 



.I(1-7Y/ 2 ((1 + 7) 1 - I/ + (1-7) W )J. 

By the second part of Lemma B.6, the term within the 
product is less than one, and thus for all large t, this 
entire bound is less than 1, which gives the result. □ 

Proof of Proposition 3.10. To start, note that, for any 
t > 0, 

(l-7 t ) 1 - e (l + 7 t ) 1+9 = (l-7 t 2 ) 1 - 9 (l + 7 t ) 29 

<exp(-7 t 2 (l-#)+ 7t (20)) 

= exp(- 7t 2 + fl 7t (2 + 7 t )). 

(B.3) 



Next, since 



< 



7 



1 1 

< 



It 



1 + 7 1 + I/7 ~ 1 + 1/7, 1 + 7* ' 
then 9 < 7/(1 + 7) implies 
d 



(hi 



(- 7 2 + 7t (2 + 7t )) = -27* + 2^(1 + 7t) 

< -2 7t + 2 7 t 
= 0. 



In particular, the expression — 7 t 2 + 7 t(2 + j t ) is de- 
creasing in 7, and thus 7* > 7 implies 

- 7 2 + 74 (2 + 7t ) < -7 2 + ^7(2 + 7), 



and consequently, combined with the bound in (B.3), 

(1 - 7t ) 1 - fl (l + lt) 1+6 < exp(- 7 2 + ^7(2 + 7))- 

Plugging this into the simplified generic bound in 
Lemma B.2 with the specialization I = exp, r = 0, 
Cj = 1, and to = 0, it follows that 



E* 



-e^A t+ i 
l|A t+ i||i 



< 



< m 



n (£*) a-^ 



< mexp 



y{t + l) 



(-7 2 + 7 (2 + 7)) 



The rest of the result follows by noting 9 < 7/(2 + 7) 
implies — 7 2 + $7(2 + 7) < 0, whereby choices 



t > 



21n(m) 



!/(7 2 -07(2 + 7)) 



exist, and plugging this all in to the above bound 
grants that M(AX t ) > 9. □ 

Proof of Theorem 3.5. For ctf{v) and ap(i/), 
Lemma 3.8 and Lemma 3.9 already state the 
results in the desired asymptotic form. 

For the other two, since I 6 L m , to can be chosen 
sufficiently large so that C to is arbitrarily close to 1, 
whereby the bounds in Lemma 3.6 and Lemma 3.7 
become sufficiently tight by taking v small and tv 
large. □ 

B.2.1. The Quantity T v 
Definition B.4. Define 

T„(7) == 

I ln(2) - I ln((l + 7) 1 - + (1 - 7 )i-") - ln(l - 7 2 ) _ 
ln(l + 7) - ln(l - 7) 

in the case that v = 1, this quantity has been exten- 
sively studied in the context of AdaBoost's margins 
(Ratsch k Warmuth, 2005; Rudin et al., 2004; Schapire 
& Freund, 2012) 

The basic properties of T„ are as follows. 
Theorem B.5. Suppose 7 e (0, 1). 

1. 7/2 < r v ( 7 ) < 7 . 

2. lim^o 7^(7) = 7. 
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The bounds 7/2 < 1i(7) < 7 were known in the 
case that v = 1 (cf. Ratsch & Warmuth (2005) and 
Schapire & Freund (2012, Bibliographic Notes, Chap- 
ter 5)). 

Proof. (Item 1, subcase T„(7) > 7/2.) To start, note 
that (■) " is a concave function, whereby 



ln((l+ 7 ) 1 -" + (l- 7 ) 1 -') 

-'m(2Q(i + 7)^ + 2(i-7)^ 



> - 



? 1.(2(1)-) 



= — In (2). 

V 

It follows that 

t ( \ ^ -ln(l -7 2 ) „ , \ 

T " (7) ^ln(l+7)-Ml-7) =Tl(7) - 

Next recall the series expansion 

' (-l) n+1 .„ 



(Item 1, subcase 1,(7) < 7.) By the power mean 
inequality (Steele, 2004, Equation 8.12), 



-l/v 



< (1+7)^(1-7)^. 



It follows that 

-'in(i±^(i +7 r+i^(i- 7 r 

< (1 + 7) ln(l + 7) + (1 - 7) Ml - 7)- 
As such, 

1,(7) 

(1 + 7) Ml + 7) + (1 - 7) ln(l - 7) 



< 



Mi + z) = J2 



Mi + 7) -Mi -7) 
-Mi + 7) -Mi -7) 
Mi + 7) -Mi -7) 

7 ln(l + 7) — 7 ln(l — 7) 

Mi + 7) - Mi - 7) 

7- 



(when \z\ < 1). Plugging this in to the simplified form 
of 1*1(7) an d P avm g attention to cancellations in the 
numerator and denominator (odd and even terms, re- 
spectively), 



1i(7) 



00 (-l)"- 1 



Eoo 
n=l 



-(7)"-E„=i 



00 (-1)"-* 



-(-7)" 



Z^n=l n ui Z^n=l n V / 1 
_OV°° (-1) 2 " +1 f s2n 



(7) 2n 



Z^ra=l 2n-l V f / 



To finish, note that n > 1 implies l/(4n — 2) < 
l/(2n) < l/(2n - 1), and thus 



7 



< 



< 





1 2(2n-l) v7J 




=1 2n-l V 1) 


7S„= 


= 1 2n\l> 




1 (~)2n 
2n-l^ '/ 


7E„= 


1 2n-l \ 1) 


2^n=] 


2«-l V '/ 



= 7- 

That is to say, 7/2 < Ti(7) < 7, which combined with 
the above also gives 1,(7) > Ti(7) > 7/2. 



(Item 2.) Consider the (halved, negated) first term 
ln((l + 7) 1 -" + (1 - 7) 1 -") - ln(2) 



v 

_ ln(0.5(l + -f) 1 -" + 0.5(1 - 7) 1 -") 
v 

By l'Hopital's rule, 

ln(0.5(l + -if- v + 0.5(1 - 7) 1 -") 



lim 



v 



lim 



-(1 + -f) 1 -" ln(l + 7) - (1 - jf-" ln(l - 7) 
u-yo ' (1 +7) 1 -" + (1 -j) 1 -" 

I ((1 + 7) in(i + 7) + (1-7) Mi -7))- 



Consequently (recalling that this term was both halved 
and negated) 

lim 1,(7) = (1 + 7) Ml + 7) + (1-7) Ml -7) 
KU ln(l + 7) - ln(l - 7) 

-Mi + 7) -Mi -7) 

ln(l + 7 )-ln(l- 7 ) 



= 7- 



□ 



The usefulness of T v is captured in the following 
lemma. 
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Lemma B.6. Let v E (0, 1] and 9 E [0, 1] be given. 
The map 



7 i y 



1 i \ 9v / 2 

I±2) (1 _ 7 y/, 

((l + 7 )'- + 



is nonincreasing over [9,1]. Additionally, now taking 
7 to be fixed, 9 < TV (7) iff 



111) (1-7V" 



1 fl + 7 
2 

.((1 + 7)1- + (l-7)i-) 
< 1. 



Proof. Let 7(7) be the prescribed map. To establish / 
is nonincreasing, it will be shown that each element of 
the product f(-f) — g(j)h(j) is nonincreasing, where 



9j//2 



- o< 2 W 2 



'7 

fc( 7 ) := (l + 7) 1 -' + (l-7) 1 -'- 
First, set z/ := v/2, and note 

^( 7) = ^(l + 7r '(^)(l_ 7r '(^) 

= i/(l +6){l + 7 )' / ( 1+e )- 1 (l - 7 )-'( 1 - e ) 

- - 0)(1 - 7 )' / ( 1 - e >- 1 (l + 7 )' / ( 1+e ) 

= l .'(l + 7) ,y ' (1+£ ' ) - 1 (l-7) ,y ' (1 - £ ' ) - 1 
.((l + 0)(l-7)-(l-0)(l + 7)) 

= 2i/(l + 7 )^'( 1+e )- 1 (l - 7 )' / ( 1 - 6 )- 1 (0 - 7) , 

where this last term is nonpositive since 9 < 7. Con- 
sequently, (7(7) is nonincreasing. 

For /i(7), note similarly that 

ft'( 7 ) = (1-1/) ((1+7)-" -(1-7)-") 

= — (fl - 7)" - (l + 7)") 

(i + 7)"(i -i) v u 7j 1 7J ; 

< 0. 

Together /( 7 ) = g{j)h(j) is nonincreasing in 7. 
For the second statement, note that 
1 / 1 1 \ 6Il//2 

'>K^) (i - 72) '" 2 

■((1+7)'-" + (1-7)'-") 



is equivalent to 

0>-ln(2) + ^ln(i±-^+^ln(l- 7 2 ) 
+ ln((l+ 7) 1 - + (l-7)i-) 
is equivalent to 
9 < 

ln(2) - § ln(l - 7 2 ) - ln((l + 7) 1 - + (1 - 7) 1 -) 

where the last expression can be written 9 < TV (7). 

□ 



B.2.2. Miscellaneous Technical Material 

Lemma B.7. Suppose A E { — l,+l} mx " is binary 
and £ Eh. Then 



2 



More simply, 



<iln(l±^ 
"2 U-7t 



ln(C7 t 4 ). 



|a?H-^M|<^ln(C t 4 ). 

Proof. Choose s E {±1} so that v t+ i — ejs for some 
ej. Then, by first order conditions on the optimal 
step size, and adopting shorthand notation where the 
summations take j fixed according to the preceding 
text, but i E [m] may vary, 

0= sA i:j i'{e] A{X t + sa t+1 e j )) 

Ai,<a 

+ sAi/iejAiXt + sat+^j)) 

Ay>0 

<C H \ s A? ex P( e 7 ~ A\ t ) exp(sa t+1 A i:j ) 

A i:j <Q 

+ c l+i s ^ ex p( e 7^ A t) cx p( sQ! t+i^) 

Ay>0 

< exp(-sa t+1 )C7 4 + 2 1 ^ sA^l' {ej AX t ) 

A i3 <0 

+ exp(sa t+1 )C t 2 +1 ^ sA l0 l' {ej AX t ), 

which can be rearranged to yield 

1. /-<iEA„<oV(e[A)\ 



sctt+i > - In 



2"V sC ?+iY,A ij>0 A ij ne]AX t ) 
1 



In 



'EA ii<0 nejAX t )\ i 



2 KEAtoot'WA* 
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To simplify further, note that 

S7t+1 ~ ||V£(AA t )||i 

||V£(ylAt)||i 
||V£(ylAt)||i 

which can be added and subtracted to yield 



£, >0 f(e7AA t ) i_ 



sit+i 



J2 Ai . <0 £'(ejAX t ) 1 + nt+i' 



whereby 



1 / 1 + s 7t+ i 
sat+i > - In 

2 V 1 - S7 t+1 



ln(C t 4 +1 )- 



Repeating the steps above to prove a lower bound on 
o.t+1, it also follows that 



1 / 1 + S7t+1 \ 1 4 

2 ^l-s 7t+i ; 2 ^ 



sa t +i < - In 

To finish the first part of the result, it suffices to con- 
sider the cases s = +1 and s = — 1 separately, which 
both lead to the desired pair of inequalities. 

For the second guarantee, first note that a®(v) = 
1/0,0(1) and ottiy) = vc niX)i an d so recalling the form 
of af(l) and scaling the first guarantee by v, it follows 
that 

|a t >)-a^)|<^ln(C ( 4 ). □ 

C. Deferred Material from Section 4 

Proof sketch of Theorem 4-3. All the convergence 
rates developed by Telgarsky (2012, Section 6) stem 
from an inequality 



£(AX t+1 ) - £ A 
< (£(AX t )-£ A ) ( 1 



\A T V£{AX 



i)\\lo 



c£(AX t ){£{AX t ) - £ A )J ' 



where c > is some constant independent of t (or 
improving with t, in which case the bound may be 
worsened by taking the choice for t = 0) (Telgarsky, 
2012, Proposition 6.2, Proposition D.6). Exactly such 
a bound was provided for each line search in the proof 
of its respective optimization guarantee in the sepa- 
rable case (cf. Lemma 3.2, Lemma 3.3; no need to 
adjust Lemma 3.4, since I = cxp and A binary causes 
0£t(v) = ap(^), and so Lemma 3.2 covers this case). 
Replacing c with the particulars for each step size will 



only impact the final rates in Theorems 6.3, 6.6, and 
6.12 by these constants. The only other thing to check 
is that t £ G, the class of losses considered by Telgar- 
sky (2012, Section 6); it can be checked directly that 
LcG. □ 



In order to establish the margin properties, the follow- 
ing lemma is essential. 

Lemma C.l. Consider the setting of Theorem J^.J^. 
Then there exists T and j so that, for all t > T , 



\A T VC(AX t )\\ a 
C(AX t ) - C A 



> 



Proof sketch. As discussed in the proof of Theo- 
rem 4.3, the results of Telgarsky (2012), which are 
superficially specialized to the Wolfe line search, carry 
over for the other line searches here with only a change 
of constants; consequently, those results carry over 
wholesale. 

To start, let S be a compact cube containing all iter- 
ates, and let j(A, S) be the corresponding generalized 
weak learning rate Telgarsky (2012, Definition 4.3). 

By (Telgarsky, 2012, Theorem 5.9), C + t im(j4+) (i.e., 
the function which is C(y) when y = A + X for some 
A € K™, and oo otherwise) has compact level sets, and 
thus strict convexity of C grants a modulus of strong 
convexity c > over S; furthermore, it holds for every 
t that 



C(A+X t )-C A 

WC(A + X t ) - P^ (s)nkw(A T ) (V/:(A + A t )) 



1 

< — 
- 2c 



where F'v£(S)nker( J 4 T ) denotes the I 1 projection onto 
V£(S) n ker(A^), the latter being the kernel 
(nullspace) of ^4^ (Telgarsky, 2012, Lemma 6.8). 

Now choose T so that, for every t >T, 



C(A+X t ) ~£ A < C(AX t ) - £ < 2c, 

which is possible by the convergence of {At}^ (cf. 
Theorem 4.3 or (Telgarsky, 2012, Theorem 6.12)). 

Using these facts, the definition of j(A, S), the choice 
4> = exp, and the fact inf \ £(A + X) = inf \ £(AX) = £ A 
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(Telgarsky, 2012, Theorem 5.9), 
||A T V£(^A t )||oo 



£{AX t )-£ A 



>i(AS) 



\V£(A\ t )~Pl nk ^ AT) (V£(A\ t ))\\ 



£{AX t 



C 



A 



l(A,S) 



Cf ( ||V£(A )A t )||i 
\£{AX t ) 

V£(A + A t ) - P 1 



C^\£{AX t )-£ A 



£{A\ t ) - £ A 



> 



7(A,S) f £(A X t ) + y/2c(C(A+X t ) - 



Crp 



£{AX t ) - £, 



j(A,S) ( £(A X t 
' \£(AX t )- 



£, 



y/(C(A+X t ) - £ A )(£(A+X t ) ~ £ A ) 



£{AX t 



To finish, set 7 := j(A, S)/Cf . 



Another technical lemma is helpful. 

Lemma C.2. Consider the setting of Theorem 4. 4-- 
For each step size choice and B > 0, there exists Tq 
so that for all t > Tg, || X t \\ 1 > B. 

Proof sketch. This follows from Theorem 4.3 and 
|-ff(A)| < to. In particular, choose any example 
i € H(A) C ; there exists e > so that ^(e 4 ^ T A) < e 
(which is a necessary condition for £(AX) < e) only 
when e^^lA < —BWejAW^, and so the result follows 
by combining this with Holder's inequality, namely 
the inequality — ej AX < ||e^^4.|| 00 1| A|| 1 ; the optimality 
guarantee provides that this holds for all large t. □ 

In order to proof the margin results, it is helpful to 
split into two cases, one being the Wolfe step sizes, 
the other being a generalization of the quadratic upper 
bound step sizes. 

Lemma C.3. Consider the setting of Theorem 4-4> 
but with step sizes 0.5a^(^) < a, < 1.5a^{v). Then 
there exists 7 > and T so that, for all t > T, all 
margins (over H(A) C ) exceed^. 



symmetry grants that Q.ba^iv) is guaranteed to be 
a worse choice than anything in the specified interval. 
As such, plugging this in to the quadratic upper bound 
yields 



£(AX t+1 ) < £(AX t ) 



c 7 t+ iP T V/:(^A t )|| 



for some constant cq > depending on C\ and not on 
t. 

Now choose T\ according to Lemma C.l; by the above 
and Lemma C.l, for any t >T\, 



£{AX t+1 )-£ A 

< £(AX t ) -C A - 

< (£(AX t ) - £ A ) (l 

< (£(AX t ) - £ A ) 1 1 



7 t+ ic HA T V£(^A t )|| oo 
2 

' 7 t +iCoP T V£(^A f )|U 
2(£(AX t -£ A ) 

7t+i c o7 



□ which, after recursive application, provides 



£(AX t+1 )-£ A <(£(AX Tl )-£ A )ex P \-^ £ 7l 

i=Ti+l 



t+l 



Since 



t+i 



|A f +i||i = ||A Tl + a * v *h 

t+l 

<I|At 1 ||i+ J2 ^ 



»=Ti+l 



t+l 



<||A Tl ||i + 1.5i/ J2 V> 

i=Ti+l 



it follows that 



£{AX t+ x)-£ A 



'• (^i-l-V/, £ x )), H ,{-^(\\\ t+1 \\ 1 -\\\ Tl \\ 1 ) 



Proof sketch. Consider the quadratic upper bound line 
search in Lemma 3.2 and its proof. It is unclear 
whether 0.5a^(V) or 1.5a^(u) give a better step, due 
to the term v. However, since <x^(l) is the minimizer, 



For any iteration t, let b t € index any ex- 

ample in [to] \ H{A) which achieves the worst margin 
(amongst elements off the hard core) for this iteration. 
Since the optimal error on this example is (Tclgar- 
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sky, 2012, Theorem 5.9), for any t>T x , 

ey^{bjAX t ) 

= exp{bjAX t ) - 

<mC Tl (L{AX t )-C A ) 



< mC Tl (C(A\ Tl -£ A ))exp f-^dlAtHi - ||A Tl ||i) 

= exp (-7Co||A t ||i/(3i/)) 
• C t (C(AX Tl - L A )) ex P ( 7C o||A Tl ||i/(3i/)) . 

v v ' 

— :cxp(r) 

Applying In and rearranging, 

-bjAX t •yco r 



l|At||i " 3^ ||At||i' 

To finish, by Lemma C.2, there exists so that 

i, , n 6rz/ 
Ml > — 

7A) 

for every i > Ta, and setting T := max{Ti,T2} gives 
the desired result. □ 

Lemma C.4. Consider the setting of Theorem 4-4> 
but specialized so that ot. L e ctf (y). Then there exists 
7 > and T so that, for all t > T, all margins exceed 
7- 

Proof sketch. Choose T\ according to Lemma C.l, and 
let t > T\ be arbitrary. Using Lemma C.l, and using 
the first Wolfe condition (eq. (2.7)) just as in the proof 
of Lemma 3.7, 

C(AX t +i) - Ca 

< C(AX t ) -L A - a f+1 (l - ^/2)||A T V£(AA t )||oo 

a t+1 (l-2//2)m T V>C(AA t )|U 



(C(AX t ) -C A )[1 



C(AX t ) - C A 

< {C(AX t ) - C A ) (1 - a*+i(l - v/2)i) 

< {C{AX t ) - C A ) exp (-a t+1 (l - v/2)j) 

Applying this inequality recursively, 
C(AX t+1 ) - C A 

< (C(AX Tl - C A )) exp -(1 - ]T o 



i=Ti 



Note next, for any to, that 



At+i||i < ||A 



to 111 



t+1 

E 



whereby 

C(AX t+1 )-C A 
< (C(AX Tl - C A )) 

•exp(-(l-^/2)7(||A t+1 || 1 -||A t0 || 1 )), 

and the remainder of the proof proceeds just as for the 
quadratic upper bound (cf. Lemma C.3). □ 

Proof sketch of Theorem 4-4- The case of ct^iy) and 
ctf* \v) are handled by Lemma C.3 and Lemma C.4. 



Now consider the case of af{u). Since 7 



0. 



Lemma C.6 grants the existence of a large T so that, 
for all t > T, 7t < 0.1. Thus, by Lemma C.5, and 
considering t sufficiently large that Ct is almost 1, the 
problem reduces to the consideration of a® (y); in par- 
ticular, the conditions to apply Lemma C.3, but now 
for the step af~(v), are satisfied. Note that this also 
handles the case a^{v), since, for Q^{v) and otf{v), it 
was assumed that A is binary and t — exp. □ 

C.l. Miscellaneous Technical Material 
Lemma C.5. For any r <G [0, 1), 

1, A + 



r < - In 

~ 2 



1 



< 



1 



Proof. Set g(r) := \ ln((l + r)/(l - r)). Note that 
ff '(r) = (l-r 2 )- 1 and g"{r) = — 

As such, g is convex (along [0, 1)) and g'(0) = 1, thus 
g(r) > r along [0,1). The second part follows from 
concavity of ln(-): 



2 \l-r) 2 V l-r ~ 2 ll-r 



□ 



Lemma C.6. Under the conditions of Theorem 4-4> 
limt^oo 7t = 0. 

Proof sketch. As discussed in the proof of Theo- 
rem 4.3, every step size provides a guarantee of the 
type 

C(AX t+1 ) < C(AX t ) - 

c 

for some c > (independent of t). The re- 
sult follows by rearranging this expression and us- 
ing C(AXt) > Ca > (i.e., nonseparability) and 
C(AX t — C(AX t +i) — !• (i.e., the convergence result, 
Theorem 4.3). □ 



